---
title: "Data Visualization to End Police Violence",
subtitle: "Replication Code no. 2"
author: "Tim Fraser, Northeastern University"
output: html_notebook
---

This code document describes how to replicate the figure "People of Color are nearly 2 times as likely to be fatally shot by the police than White peers," posted on Facebook on June 10.

# 0. Packages

```{r, message = FALSE, warning = FALSE}
library("BaylorEdPsych")
library(tidyverse)
library(car)
```

# 1. Gather data and impute missing data

```{r, message= FALSE, warning = FALSE}
# One key problem with law enforcement today is that Black Americans 
# are far more likely to be shot by the police than are White Americans.
# What if we could model that likelihood, and show the effect of key interventions?
# Get mode
mode = function(x){names(sort(table(x), decreasing = TRUE))[1]}


dat <- read_csv("https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv") %>%
  mutate(shot_while_poc = if_else(race %in% c("B", "A", "N", "H", "O"), 1, 0),
         shot_while_white = if_else(race %in% c("W"), 1, 0)) %>%
  
  # Recode race into readable categories
  mutate(race = race %>% dplyr::recode(
    "W" = "White, non-Hispanic",
    "B" = "Black, non-Hispanic",
    "A" = "Asian",
    "N" = "Native American",
    "H" = "Hispanic",
    "O" = "Other")) %>%
  # WAPO recorded what the victims were armed with upon death.
  # If they were armed with something that was not a weapon,
  # but instead a power tool, we'll categorized it as such:
  # Household Tool
  mutate(armed = if_else(armed %in% c("scissors", "flashlight", "hammer", "shovel", "pen",
                                      "stapler", "beer bottle", "contractor's level", "barstool",
                                      "screwdriver", "lawn mower blade", "straight edge razor",
                                      "garden tool", "carjack", "walking stick", "flagpole", "chair",
                                      "metal rake", "air conditioner", "wasp spray"),
                         true = "household tool", false = armed, missing = NA_character_)) %>%
  # Unarmed
  mutate(armed = if_else(armed %in% c("toy weapon", "unarmed", "undetermined", "pepper spray"),
                         true = "unarmed", false = armed, missing = NA_character_)) %>%
  # Unknown weapon
  mutate(armed = if_else(armed %in% c("unknown weapon", "claimed to be armed", 
                                      "metal stick", "metal hand tool", "metal object",
                                      "sharp object", "blunt object"),
                         true = "unknown weapon", false = armed, missing = NA_character_)) %>%
  # Firearms and explosives
  mutate(armed = if_else(armed %in% c("gun", "gun and sword", "BB gun", "BB gun and vehicle","airsoft pistol",
                                      "pellet gun","hachet and gun",
                                      "nail gun",  "gun and sword", "pellet gun", 
                                      "guns and explosives","machete and gun", "gun and car",
                                      "bean-bag gun", "gun and knife", "hatchet and gun", 
                                      "vehicle and gun",
                                      "air pistol", "Airsoft pistol",
                                      "grenade", "incendiary device"),
                         true = "firearms and explosives", false = armed, missing = NA_character_)) %>%
  # Potential Weapon
  mutate(armed = if_else(armed %in% c("knife", "sword", "machete",
                                      "car, knife and mace", "baseball bat and knife",
                                      "ice pick", "bayonet", "hachet and gun", "Taser",
                                      "baseball bat and fireplace poker", "hatchet",
                                      "pick-axe", "baton", "pitchfork", "glass shard",
                                      "vehicle and machete", "samurai sword", "fireworks",
                                      "ax", "hand torch", "pole and knife", 
                                      "wrench", "hachet", "boxcutter", "pole", "vehicle",
                                      "spear", "rock", "piece of wood", "pipe", "box cutter",
                                      "tire iron", 
                                      "motorcycle", "crowbar", "oar", "chainsaw", "baseball bat and bottle",
                                       "cordless drill", "crossbow", "bow and arrow", "gun and vehicle",
                                      "metal pole", "metal pipe", "meat cleaver", "chain",
                                      "bean-bag gun",  "brick", "baseball bat",
                                      "chain saw"),
                         true = "potential weapon", false = armed, missing = NA_character_)) %>%
  # Recode as fleeing, not fleeing, or NA
  mutate(flee = if_else(flee %in% c("Car", "Foot", "Other"), "Fleeing", flee)) %>%
  mutate(threat_level = if_else(threat_level == "attack", "attack", "not attack"))
  

# Now create data for modeling:
dat2 <- dat %>%
  # Convert catgories to character, so that we can then..
  mutate_at(vars(armed, gender, threat_level, flee, body_camera), funs(as.character(.))) %>%
  # Impute missing data with the mode as a character input
  mutate_at(vars(armed, gender, threat_level, flee, body_camera),
            funs(if_else(!is.na(.), ., mode(.))))  %>%
  # Next, I set the baseline category for most categorical variables 
  # to be the model category, so we can describe most shootings and control for the difference
  mutate(
    threat_level = threat_level %>% factor %>% relevel(ref = "attack"),
    signs_of_mental_illness = signs_of_mental_illness %>% factor %>% relevel(ref = "FALSE"),
    armed = armed %>% factor %>% relevel(ref = "firearms and explosives"),
    body_camera = body_camera %>% factor %>% relevel(ref = "FALSE"),
    flee = flee %>% factor %>% relevel(ref = "Not fleeing"))  %>%
  # Now inpute age with mean
  mutate(age = if_else(is.na(age), mean(age, na.rm = TRUE), age))

```

```{r}
# Write a function to get modal categories
tally = function(variable){
  dat %>%
    group_by(!!sym(variable)) %>%
    summarize(n = n(),
              percent = (n() / nrow(dat) * 100) %>% round(2))
}
```

# 2 Run models

```{r, echo = FALSE, message= FALSE, warning = FALSE}
# Run a model with all available data,

# Predicting likelihood of a shooting victim being white
mw <- dat2 %>%
    glm(formula = shot_while_white ~ date + state + armed + age + gender + 
        threat_level + flee + body_camera, family = "binomial")

# Predicting likelihood of a shooting victim being a person of color
mp <- dat2 %>%
    glm(formula = shot_while_poc ~ date + state + armed + age + gender + 
        threat_level + flee + body_camera, family = "binomial")

```


## 2.1 Test Variance Explained

There are dozens of reasons why police shoot citizens. It can be hard to distill any death down into one reason, let alone these eight reasons. For this reason, our PseudoR2 (eg. McFadden or Nagelkerke's) tend to be around 10-25%. However, these models are quite good at predicting whether you have an above 50% or below 50% chance of getting the outcome (being shot while black, shot while white, shot while poc, etc.)

This is depicted by Count R2. We will use this as a measure to *describe* peoples' different likelihood of being shot based on their race. 

```{r}
mw %>%
  PseudoR2() %>% round(2)
```

```{r}
mp %>% PseudoR2() %>% round(2)
```

## 2.2 Test for Multicolinearity

```{r}
mw %>% vif()
```


```{r}
mp %>% vif()
```

Looks good! 


## 2.3 Model Table

```{r}

library(texreg)
texreg::htmlreg(
  list(mp, mw),
  stars = c(0.001, 0.01, 0.05, 0.10),
  bold = 0.10,
  file = "table1.html",
  # Excludes beta coefficients for state, of which there are 50 (too many for chart)
  omit.coef = "state",
  custom.model.names = c("Shot while Person of Color", "Shot while White"),
  custom.coef.map = list(
    "date" = "Date",
    "age" = "Age",
    "genderM" = "Male",
    "aremdunarmed" = "Unarmed",
    "armedhousehold tool" = "Carrying a Household Tool",
    "armedunknown weapon" = "Carrying an unknown weapon",
    "armedpotential weapon" = "Carrying a potential weapon",
    "threat_levelnot attack" = "No threat of attack",
    "fleeFleeing" = "Fleeing",
    "body_cameraTRUE" = "Officer wore Body Camera",
    "(Intercept)" = "Constant"),
  custom.note = "Statistically significant results indicated by *** for p < 0.001, ** for p < 0.01, * for p < 0.05, and . for p < 0.10. Model includes fixed effects dummy variables by state. These were excluded from the table to conserve space.",
#    custom =  "Predictors of Race among Victims of Fatal Shooting by Police"
)

```


```{r}
# What's the likelihood that you were Black if you had the following traits?
newdata <- dat2 %>%
  with(data.frame(date = seq(from = as.Date("2015-01-01"), to = as.Date("2019-12-31"), by = "days"),
                  state = "CA", 
                  armed = "unarmed",
                  age = 37.12883, # mean age, unimputed dat$age %>% mean(na.rm = TRUE)
                  gender = "M", 
                  threat_level = "not attack",
                  flee = "Not fleeing",
                  body_camera = "FALSE")) 

# Based on this, let's predict what happens when
# an officer isn't wearing a body camera, the threat level is low,
# the person is Male and unarmed, 
# and they live in the modal state with the mean age and mean date

# What's the likelihood that someone shot by the police is a person of color 
# if they have the aforementioned traits?

pw <- predict.glm(mw, newdata = newdata, se.fit = TRUE, type = "response") %>%
  as.data.frame() %>%
  bind_cols(newdata) %>%
  mutate(model = "White")
# 30% (surprisingly low, given their share of the population)

pp <- predict.glm(mp, newdata = newdata, se.fit = TRUE, type = "response") %>%
  as.data.frame() %>%
  bind_cols(newdata) %>%
  mutate(model = "People of Color")
# 58% (surprisingly high, given the small share of the population that persons of color make up)

```




```{r}
# This charts the cumulative probability of a person being shot by the police 
# while a person of color vs. while white
# over time, from 2015 to 2020, based on observed data.
bind_rows(
  pw, pp) %>%
  arrange(date) %>%
  group_by(model) %>%
  mutate(projection = fit %>% cumsum(),
         ymin = cumsum(fit + se.fit*1.96 ),
         ymax = cumsum(fit - se.fit*1.96)) %>%
  ungroup() %>%
  rename(`Victims by Race` = model) %>%
  ggplot(mapping = aes(x = date, y = projection, ymin = ymin, ymax = ymax, 
                       group = `Victims by Race`, color = `Victims by Race`, 
                       fill = `Victims by Race`, linetype = `Victims by Race`)) +
  geom_ribbon(alpha = 0.5) +
  geom_line() +
  theme_minimal() +
  theme(legend.position = "right",
        plot.caption = element_text(hjust = 0.5),
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        panel.grid.minor = element_blank()) +
  labs(
    title = "People of Color are nearly 2 times as likely to be \n fatally shot by the police than White peers.",
    y = "Cumulative Probability of \n suffering a Fatal Shooting by Police",
       x = "Date",
       caption = "Lines represent cumulative predicted likelihood of being fatally shot by police given a specific race, \n with bands showing 95% confidence intervals. Predictions for an unarmed man neither attacking nor fleeing,\nwhere the officer was not wearing a body camera. Victim's age and state reflect the mean and mode of the data. \n Data from Washington Post 2015-2020. Learn more at www.joincampaignzero.org.") 

```

# 3. Summary

In summary, this analysis did the following:

First, I modeled the likelihood that someone fatally shot by the police was a Person of Color, and the likelihood that someone fatally shot by the police was White. This strategy was adopted because it is not possible to examine the counterfactual, why do the police shoot some people but not others. After all, we don't really have a list of all the people police have not shot. Instead, I used these two models to examine this concept.

Second, each model controlled for the victim's age, the state of the shooting, the date of the shooting, whether the victim was armed, whether they attacked the officer, whether they fled, and whether the officer's body camera was on. I categorized these as best as I could using data from the Washington Post's dataset of 5401 fatal shootings by police from 2015 to 2020. None of these variables demonstrated problematic multicollinearity. Each model generated McFadden's and Nagelkerke's Pseudo R2 of 0.10~0.25. This is not a very high measure of variation explained, but that's also not surprising, considering that there are lots of other factors that might affect police shootings, including many traits about the police. However, even with just controlling for these main factors, these models generate a Count R2 measure of 0.65 ~ 0.75. This measures what percentage of victims shot does the model correctly predict were People of Color or not, after rounding the predicted probability for each respondent to 0 or 1. This is a fairly high score and indicates that we can generally speaking predict quite well the race of a fatally wounded victim in the dataset.

Third, we generated the predicted probability of an individual who was fatally shot by police being a person of color. We repeated this process for the probability that they were white. We used the aforementioned models to make these predictions, generating predicted probabilities for an unarmed person of average age and from the modal state (CA) in the dataset, who was not attacking or fleeing, and the officer was not wearing a body camera. In other words, we used the model to predict how likely a victim was to be a person of color if they did nothing to precipitate an attack from the police - a truly and indisputably innocent death.

Then, we generated 95% confidene intervals from these predictions using the standard error for each day from 2015 to 2020. This allows us to measure the cumulative predicted probability of a person shot by the police being a person of color over time.

For example, the outright probability of a white person with the traits described above being shot by the police, this analysis found, was 30%. But the probability of a person of color being shot by the police was 58%. If we combine these daily predictions over time, the probability of a person being shot by the police by day 2 would double from 58% to 116%, for example, and so on and so own. This highlights the increasing propensity of being harmed by the police for a person of color over time.


This data came from the Washington Post's collection of reported fatal shootings by police in the US from 2015 to 2020. The data is contained here: https://github.com/washingtonpost/data-police-shootings.

Learn more about how we can end police violence against people of color and all members of our communities at www.joincampaignzero.org.
